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Abstract 

In this paper we describe how the trans- 
lation methodology adopted for the Spo- 
ken Language Translator (SLT) addresses 
the characteristics of the speech transla- 
tion task in a context where it is essential 
to achieve easy customization to new lan- 
guages and new domains. We then discuss 
the issues that arise in any attempt to eval- 
uate a speech translator, and present the 
results of such an evaluation carried out on 
SLT for several language pairs. 



The nature of the speech 
translation task 



Speech translation is in many respects a particularly 



typically involving a vocabulary of a few thousand 
words. Because of this, it is desirable that a speech 
translator should be easily portable to new domains. 
Portability to new languages, involving the acquisi- 
tion of both monolingual and cross-linguistic infor- 
mation, should also be as straightforward as possi- 
ble. These ends can be achieved by using general- 
purpose components for both speech and language 
processing and training them on domain-specific 
speech and text corpora. The training should be au- 
tomated whenever possible, and where human inter- 
vention is required, the process should be deskilled 
to the level where, ideally, it can be carried out by 
people who are familiar with the domain but are not 
experts in the systems themselves. 

These points will be discussed in the context of 
the Spoken Language Translator (SLT) (Rayner, Al- 
shawi et al, 1993; Agnas et al., 199^ ; [Rayner and 



difficult version of the translation task — High quality 



Carter, 1997), a customizable speech translator built 



output is essential: the speech produced must sound 
natural if it is to be easily comprehensible. The 
quality of the translation itself must also be high, in 
spite of the fact that, by the nature of the problem, 
no post-editing is possible. Things are equally dif- 
ficult on the input side: pre-editing, too, is difficult 
or impossible, yet ill-formed input and recognition 
errors are both likely to be quite common. Thus ro- 
bust analysis and translation are also required. Fur- 
thermore, any attempted solutions to these problems 
must be capable of operating at a speed close enough 
to real time that users are not faced with unaccept- 
able delays. 

Together, these factors mean that speech transla- 
tion is currently only practical for limited domains. 



as a pipelined sequence of general-purpose compo- 
nents. These components are: a version of the De- 



cipher (TM) speech recognizer (Murveit et al, 1993) 
for the source language; a copy of the Core Language 



Engine (CLE) (Alshawi (ed), 1992) for the source 
language; another copy of the CLE for the target 
language; and a target language text-to-speech syn- 
thesizer. 

The current SLT system carries out multi-lingual 
speech translation in near real time in the ATIS do- 
main ( Hemphill et al., 1990 ) for several language 
pairs. Good demonstration versions exist for the 
four pairs English — > Swedish, English — > French, 
Swedish English and Swedish Danish. Pre- 
liminary versions exist for five more pairs: Swedish 



— > French, French Enghsh, English Danish, 
French — > Spanish and Enghsh — » Spanish. 

We describe the methodology used to build the 
SLT system itself, particularly in the areas of cus- 
tomization (Section H), robustness (Section ||), and 
multilinguality (Section U). For further details on 
the topics of customization and multilinguality, see 



(Rayner, Bretan et al, 1996; Rayner, Carter et al. 



1997); and on robustness, see (Rayner and Carter 



1997 ). We then discuss the evaluation of speech 
translation systems. This is an area that deserves 
more attention than it has received to date; indeed, 
it is not obvious how best to perform such an eval- 
uation so as to measure meaningfully the perfor- 
mance both of the overall system and of each of its 
components. In Sections || and |6| of this paper, we 
therefore consider the characteristics an evaluation 
should have, and describe one we have carried out, 
discussing the extent to which it meets the desired 
criteria. 

2 Customization to languages and 
domains 

In the Core Language Engine, the language process- 
ing component of the Spoken Language Translator 
system, we address the requirement of portability 
by maintaining a clear separation between (1) the 



These are defined using explicit feature-value equa- 
tions which must be written by a skilled grammar- 
ian. For a given language pair, the more complex 
transfer rules, which tend to be for function words 
and other commonly-occurring, idiosyncratic words, 
can also involve arbitrarily large, recursive struc- 
tures. However, nearly all of these monolingual and 
bilingual rules are domain-independent. 

On the other side of the coin, the main domain- 
dependent aspects of a linguistic description are 
lexicon entries defining content words in terms of 
existing behaviours, and simple (atomic-to-atomic) 
transfer rules. These do need to be created manually 
for each new domain, but they are simple enough to 
be defined by non-experts with the help of relatively 
simple graphical tools. See Figures 1 and 2 for some 
examples of these two kinds of rule (the details of 
the formalism are unimportant here, we intend sim- 
ply to illustrate the differences in complexity). 

When moving to a new language, more expert in- 
tervention is typically required than for a new do- 
main, because many of the complex rules do need 
some modifications. However, we have found that 
the amount of work involved in developing new 
grammars for Swedish, French, Spanish and most 
recently Danish has always been at least an order of 
magnitude less than the effort required for the orig- 
inal grammar (Gamback and Rayner, 1992; Rayner 



system code; (2) linguistic rules, including lexicon Carter and Bouillon, 1996|; [Rayner, Carter et al. 



entries, to generate possible analyses and transla- 
tions non-deterministically; and (3) statistical infor- 
mation, to choose between these possibilities. The 
practical advantage of this architecture is that most 
of the work involved in porting the system to a new 
domain is concerned with the parts of the system 
that can be modified by non-experts: the central ac- 
tivities are addition of new lexicon entries, and su- 
pervised training to derive the statistical preference 
information. Porting to new languages is a more 
complex task, but still only involves modifications 
to a relatively small subset of the whole system. In 
more detail: 

(1) The system code is completely general-purpose 
and does not need any changes for new domains or, 
other than in exceptional cases ,|^ for new languages. 

(2) The more complex of the linguistic rules for a 
given language are the grammar, the function word 
lexicon, and the macros defining common content 
word behaviours (count noun, transitive verb, etc). 



1997) 



(3) The statistical information used in analysis is 
entirely derived from the results of supervised train- 
ing on corpora carried out using the TreeBanker 
( [Carter, 1997 ), a graphical tool that presents a non- 
expert user with a display of the salient differences 
between alternative analyses in order that the cor- 
rect one may be identified. Once a user has become 
accustomed to the system, around two hundred sen- 
tences per hour may be processed in this way. This, 
together with the use of representative subcorpora 



^E.g. in our initial extension from English to lan- 
guages with more complicated morphology, which ne- 
cessitated the development of a morph ological proce ssor 
based on the two- level formalism (see ( [Carter, 1995 )). 



(Rayner, Bouillon and Carter, 1995) to allow struc- 
turally equivalent sentences to be represented by a 
single example, means that a corpus of many thou- 
sands of sentences can be judged in just a few person 
weeks. The principal information extracted auto- 
matically from a judged corpus is: 

• Constituent pruning rules, which allow the de- 
tection and removal, at intermediate stages 
of parsing, of syntactic constituents occur- 
ring in contexts where they are unlikely to 
contribute to the correct parse. Removing 
these constituents significantly constrains the 
search space and speeds up parsing ( Rayner and 



Syntax rule for S ^ NP VP: 

syn(s_np_vp_Normal , core, 
[s: [@s_np_f eats(MMM) , (§vp_f eats (MM) , 

seiiteiitialsubj=SS , sai=Aux, hascomp=n, conjoined=n] , 
np: [@s_np_f eats(MMM) ,vf orm=(f in\/to) , relational=_,temporal=_,agr=Ag, 

sentential=SS , wh=_, whmoved=_ ,pron=_ ,nf orm=Sf m] , 
vp: [@vp_feats(MM) ,vform=(\(en)) ,agr=Ag,sai=Aux, modif iable=_ , 
mainv=_ , headf inal=_ , sub j f orm=Sf m] ] ) . 

Macro definition for syntax of transitive verb: 

macro (v_subj _obj , 

[v : [vf orin=base , mhdf 1=A , passive=A , gaps=B , conj oined=n , 

subcat= [np : [r elat ional=_ , pass ive=A , wh=_ , gap=_ , gaps=B , 
temporal=_ ,pron=_ , case=nonsubj] ] ] ] ) . 

Transfer rule relating English adjective "early" and French PP "de bonne heure" : 

trule( [eng,f re] , semi_lex(early-de_bonne_heure) , 
[early_NotLate , tr (arg) ] 

Of orm(prep( 'de bonne heure_Early' ) ,_, 
P- [P,tr(arg) , 

StermCref (pro ,de_bonne_heure , sing, _) , 
V,W- [time,W] )+_])) . 

Figure 1: Complex, domain- independent linguistic rules 



Carter, 1997). 



and Carter, 1997), combining two different transla- 



• An automatic tuning of the grammar to the 
domain using the technique of Explanation- 



1988; Rayner, 1988; Samuelsson and Rayner 



1991 



Based Learning (van Harmelen and Bundy, 



Rayner and Carter, 1996). This rewrites 



it to a form where only commonly-occurring 
rule combinations are represented, thus reduc- 
ing the search space still further and giving an 
additional significant speedup. 

• Preference information attached to certain char- 
acteristics of full analyses of sentences - the 
most important being semantic triples of head, 
relationship and modifier - which allow a selec- 
tion to be m ade between competing full anal- 
yses. See ( Alshawi and Carter, 1994 ) and 



( [Carter, 1997| ) for details 



A similar mechanism has been developed to allow 
users to specify appropriate translations, giving rise 
to preferences on outcomes of the transfer process. 
Work on this continues. 

3 Robustness 

Robustness in the face of ill-formed input and recog- 
nition errors is tackled by means of a "multi-engine" 
strategy (Frederking and Nirenburg, 1994; Rayner 



tion methods. The main translation method uses 



transfer at the level of QLF (Alshawi et al., 1991 



Rayner and Bouillon, 1995); this is supplemented by 
a simpler, glossary-based translation method. Pro- 
cessing is carried out bottom-up. Roughly speak- 
ing, the QLF transfer method is used to translate as 
much as possible of the input utterance, any remain- 
ing gaps being filled by application of the glossary- 
based method. 

In more detail, source-language parsing goes 
through successive stages of lexical (morphological) 
analysis, low-level phrasal parsing to identify con- 
stituents such as simple noun phrases, and finally 
full sentential parsing using a version of the original 
grammar tuned to the domain using explanation- 
based learning (see Section || above). Parsing is car- 
ried out in a bottom-up mode. After each parsing 
stage, a corresponding translation operation takes 
place on the resulting constituent lattice. Trans- 
lation is performed by using the glossary-based 
method at the early stages of processing, before 
parsing is initiated, and by using the QLF-transfer 
method during and after parsing. Each successful 
transfer attempt results in a target language string 
being added to a target-side lattice. Metrics are 
then applied to choose a path through this lattice. 
The criteria used to select the path involve prefer- 



Lexicon entry, using transitive verb macro, for "serve" as in "Does Continental serve Atlanta?" : 

Ir (serve, v_subj_obj , serve_FlyTo) . 
Transfer rule relating that sense of "serve" to one sense of French "desservir" : 

trule( [eng.fre] ,lex(simple) , serve_FlyTo==desservir_ServeCity) . 

Figure 2: Simple, domain-dependent linguistic rules 



ences for sequences that have been encountered in a 
target-language corpus; for the use of more sophisti- 
cated transfer methods over less sophisticated; and 
for larger over smaller chunks. 

The bottom-up approach contributes to robust- 
ness in the obvious way: if a single analysis can- 
not be found for the whole utterance, then transla- 
tions can be produced for partial analyses that have 
already been found. It also contributes to system 
response in that the earlier, more local, shallower 
methods of analysis and transfer usually operate 
very quickly to produce an attempt at translation. 
The target-language user may interrupt processing 
before the more global methods have finished if the 
translation (assuming it can be viewed on a screen) 
is adequate, or the system itself may abandon a sen- 
tence, and present its current best translation, if a 
specified time has elapsed. 

Figure ^ exemplifies the operation of the multi- 
engine strategy as well as of the preferences applied 
to analysis and transfer]^ The N-best list delivered 
by the speech recognizer contains the sentence ac- 
tually uttered, "Could you show me an early flight 
please?" , but only in fourth position. 

• Before any linguistic processing is carried out, 
the word sequence at the top of the N-best list 
is the most preferred one, as only recognition 
preferences (shown by position in the list) are 
available. This sequence is translated word-for- 
word using the glossary method, giving result 
(a) in the figure. 

• After lexical analysis, which effectively includes 
part-of-speech tagging, it is determined that the 
word "a" is unlikely to precede "are" , and so "a" 
is dropped from the translated sequence (b) - 
thus translating recognizer hypothesis 2, using 
the glossary-based method. 

• Phrasal parsing identifies "an early flight" as 
a likely noun phrase, so that this is for the 

^The example chosen was the most interesting of 
the dozen or so in our most recent demonstration ses- 
sion, and the intermediate results have been reproduced 
from the system log file without any changes other than 
reformatting. 



first time selected for translation, in (c). Note 
that the system has now settled on the correct 
English word sequence. QLF-based transfer is 
used for the first time, and the transfer rule 
in Figure 1 is used to translate "early" as "de 
bonne heure" which, because it is a PP, is placed 
after "vol" (flight) by the French grammar. 

• Finally, as shown in (d), an analysis and a QLF- 
based translation are found for the whole sen- 
tence, allowing the inadequate word-for-word 
translation of "could you show me" as "*pour- 
riez vous montrez moi" to be improved to a 
more grammatical "pourriez-vous m'indiquer" . 

We thus see the results of translation becoming 
steadily more accurate and comprehensible as pro- 
cessing proceeds. 

4 Multilinguality, interlinguas and 
the "N-squared problem" 

While using an interlingual representation would 
seem to be the obvious way to avoid the "N-squared 
problem" (translating between TV languages involves 
order N"^ transfer pairs), we are sceptical about in- 
terlinguas for the following reasons. 

Firstly, doing good translation is a mixture of two 
tasks: semantics (getting the meaning right) and col- 
location (getting the appearance of the translation 
right). Defining an interlingua, even if it is possible 
to do so for an increasing number N of languages, 
really only addresses the first task. Interlingual rep- 
resentations also tend to be less portable to new do- 
mains, since they if they are to be truly interlingual 
they normally need to be based on domain concepts, 
which have to be redefined for each new domain - 
a task that involves considerable human interven- 
tion, much of it at an expert level. In contrast, a 
transfer-based representation can be shallower (at 
the level of linguistic predicates) while still abstract- 
ing far enough away from surface form to make most 
of the transfer rules simple atomic substitutions. 

Secondly, systems based on formal representa- 
tions are brittle: a fully interlingual system first 
needs to translate its input into a formal repre- 
sentation, and then realise the representation as a 



N-best list (N=5) delivered by speech recognizer: 

1 could you show me a are the flight please 

2 could you show me are the flight please 

3 could you show me in order a flight please 

4 could you show me an early flight please 

5 could you show meals are the flight please 

(a) Selected input sequence and translation after surface phase: 



could 


you 


show 


me 


a 


are 


the 


flight 


please 


pourricz 


vous 


montrcz 


moi 


un 


sont 


les 


vol 


s'il vous plait 



(b) Selected input sequence and translation after lexical phase: 



could 


you 


show 


me 


are 


the 


flight 


please 


pourricz 


vous 


montrcz 


moi 


sont 


les 


vol 


s'il vous plait 



(c) Selected input sequence and translation after phrasal phase: 



could 


you 


show 


me 


an early flight 


please 


pourricz 


vous 


montrcz 


moi 


un vol de bonne heure 


s'il vous plait 



(d) Selected input sequence and translation after full parsing phase: 

could you show me an early flight please 
pourricz-vous m'indiqucr un vol de bonne heure s'il vous plait 



Figure 3: N-best list and translation results for "Could you show me an early flight please?" 



target-language string. An interlingual system is 
thus inherently more brittle than a transfer system, 
which can produce an output without ever identify- 
ing a "deep" formal representation of the input. For 
these reasons, we prefer to stay with a fundamen- 
tally transfer-based methodology; none the less, we 
include some aspects of the interlingual approach, by 
regularizing the intermediate QLF representation to 
make it as language-independent as possible conso- 
nant with the requirement that it also be indepen- 
dent of domain. Regularizing the representation has 
the positive effect of making the transfer rules sim- 
pler (in the limiting case, a fully interlingual system, 
they become trivial). 

We tackle the N-s guared problem by means of 
transfer composition (Rayner, Carter and Bouillon 



1996 ; Rayner, Carter et al, 1997 ) . If we already have 
transfer rules for mapping from language A to lan- 
guage B and from language B to language C, we can 
compose them to generate a set to translate directly 
from A to C. The flrst stage of this composition 
can be done automatically, and then the results can 
be manually adjusted by adding new rules and by 
introducing declarations to disallow the creation of 
implausible rules: these typically arise because the 
contexts in which a d A can correctly be translated 
to P € B are disjoint from those in which f3 can 



be translated into 7 S C. As with the other cus- 
tomization tasks described here, the amount of hu- 
man intervention required to adjust a composed set 
of transfer rules is vastly less, and less specialized, 
than what would be required to write them from 
scratch. 

In the current version of SLT, transfer rules were 
written directly for neighbouring languages in the 
sequence Spanish - French - English - Swedish - 
Danish (most of these neighbours being relatively 
closely related), with other pairs being derived by 
transfer composition. Further details can be found 
in ( Rayner, Carter et al, 1997| ). 



5 Evaluation of speech translation 
systems: methodological issues 

There is still no real consensus on how to evaluate 
speech translation systems. The most common ap- 
proach is some version of the following. The system 
is run on a set of previously unseen speech data; the 
results are stored in text form; someone judges them 
as acceptable or unacceptable translations; and fi- 
nally the system's performance is quoted as the pro- 
portion that are acceptable. This is clearly much 
better than nothing, but still contains some serious 
methodological problems. In particular: 



1. There is poor agreement on what constitutes 
an "acceptable translation". Some judges re- 
gard a translation as unacceptable if a single 
word-choice is suboptimal. At the other end 
of the scale, there are judges who will accept 
any translation which conveys the approximate 
meaning of the sentence, irrespective of how 
many grammatical or stylistic mistakes it con- 
tains. Without specifying more closely what is 
meant by "acceptable" , it is difficult to compare 
evaluations. 

2. Speech translation is normally an interactive 
process, and it is natural that it should be less 
than completely automatic. At a minimum, it is 
clearly reasonable in many contexts to feed back 
to the source-language user the words the rec- 
ognizer believed it heard, and permit them to 
abort translation if recognition was unaccept- 
ably bad. Evaluation should take account of 
this possibility. 

3. Evaluating a speech-to-speech system as though 
it were a speech-to-text system introduces a cer- 
tain measure of distortion. Speech and text are 
in some ways very different media: a poorly 
translated sentence in written form can nor- 
mally be re-examined several times if necessary, 
but a spoken utterance may only be heard once. 
In this respect, speech output places heavier de- 
mands on translation quality. On the other 
hand, it can also be the case that construc- 
tions which would be regarded as unacceptably 
sloppy in written text pass unnoticed in speech. 

We are in the process of redesigning our transla- 
tion evaluation methodology to take account of all of 
the above points. Currently, most of our empirical 
work still treats the system as though it produced 
text output; we describe this mode of evaluation in 
Section |5.l| . A novel method which evaluates the 
system's actual spoken output is currently undergo- 
ing initial testing, and is described in Section ^.2| . 
Section ^ presents results of experiments using both 
evaluation methods. 

5.1 Evaluation of speech to text translation 

In speech-to-text mode, evaluation of the system's 
performance on a given utterance proceeds as fol- 
lows. The judge is first shown a text version of 
the correct source utterance (what the user actually 
said), followed by the selected recognition hypoth- 
esis (what the system thought the user said). The 
judge is then asked to decide whether the recogni- 
tion hypothesis is acceptable. Judges are told to as- 



sume that they have the option of aborting transla- 
tion if recognition is of insufficient quality; judging a 
recognition hypothesis as unacceptable corresponds 
to pushing the 'abort' button. 

When the judge has determined the acceptabil- 
ity of the recognition hypothesis, the text version of 
the translation is presented. (Note that it is not 
presented earlier, as this might bias the decision 
about recognition acceptability.) The judge is now 
asked to classify the quality of the translation along 
a seven-point scale; the points on the scale have 
been chosen to reflect the distinctions judges most 
frequently have been observed to make in practice. 
When selecting the appropriate category, judges are 
instructed only to take into account the actual spo- 
ken source utterance and the translation produced, 
and ignore the recognition hypothesis. The possible 
judgement categories are the following; the headings 
are those used in Tables |^ and ^ below. 

Fully acceptable. Fully acceptable translation. 

Unnatural style. Fully acceptable, except that 
style is not completely natural. This is most 
commonly due to over-literal translation. 

Minor syntactic errors. One or two minor syn- 
tactic or word-choice errors, otherwise accept- 
able. Typical examples are bad choices of de- 
terminers or prepositions. 

Major syntactic errors. At least one major or 
several minor syntactic or word-choice errors, 
but the sense of the utterance is preserved. The 
most common example is an error in word-order 
produced when the system is forced to back up 
to the robust translation method. 

Partial translation. At least half of the utterance 
has been acceptably translated, and the rest is 
nonsense. A typical example is when most of 
the utterance has been correctly recognized and 
translated, but there is a short 'false start' at 
the beginning which has resulted in a word or 
two of junk at the start of the translation. 

Nonsense. The translation makes no sense. The 
most common reason is gross misrecognition, 
but translation problems can sometimes be the 
cause as well. 

Bad translation. The translation makes some 
sense, but fails to convey the sense of the source 
utterance. The most common reason is again a 
serious recognition error. 



Results arc presented by simply counting the num- 
ber of translations in a run which fall into each cat- 
egory. By taking account of the "unacceptable hy- 
pothesis" judgements, it is possible to evaluate the 
performance of the system either in a fully automatic 
mode, or in a mode where the source-language user 
has the option of aborting misrecognized utterances. 

5.2 Evaluation of speech to speech 
translation 

Our intuitive impression, based on many evalua- 
tion runs in several different language-pairs, is that 
the "fine-grained" style of speech-to-text evaluation 
described in the preceding section gives a much 
more informative picture of the system's perfor- 
mance than the simple acceptable/unacceptable di- 
chotomy. However, it raises an obvious question: 
how important, in objective terms, are the distinc- 
tions drawn by the fine-grained scale? The prelim- 
inary work we now go on to describe attempts to 
provide an empirically justifiable answer, in terms 
of the relationship between translation quality and 
comprehensibility of output speech. Our goal, in 
other words, is to measure objectively the ability of 
subjects to understand the content of speech out- 
put. This must be the key criterion for evaluating 
a candidate translation: if apparent deficiencies in 
syntax or word-choice fail to affect subject's ability 
to understand content, then it is hard to say that 
they represent real loss of quality. 

The programme sketched above is difficult or, ar- 
guably, impossible to implement in a general setting. 
In a limited domain, however, it appears quite feasi- 
ble to construct a domain-specific form-based ques- 
tionnaire designed to test a subject's understanding 
of a given utterance. In the SLT system's current 
domain of air travel planning (ATIS), a simple form 
containing about 20 questions extracts enough con- 
tent from most utterances that it can be used as a 
reliable measure of a subject's understanding. The 
assumption is that a normal domain utterance can 
be regarded as a database query involving a limited 
number of possible categories: in the ATIS domain, 
these are concepts like flight origin and destination, 
departure and arrival times, choice of airline, and so 
on. A detailed description of the evaluation method 
follows. 

The judging interface is structured as a hyper- 
text document that can be accessed through a web- 
browser. Each utterance is represented by one web 
page. On entering the page for a given utterance, 
the judge first clicks a button that plays an audio 
file, and then fills in an HTML form describing what 
they heard. Judges are allowed to start by writing 



down as miich as thc;y can of the utteranc;e, so as to 
keep it clear in their memory as they fill in the form. 

The form is divided into four major sections. The 
first deals with the linguistic form of the enquiry, 
for example, whether it is a command (imperative), 
a yes/no-question or a wh-question. In the second 
section the judge is asked to write down the princi- 
pal "object" of the utterance. For example, in the 
utterance "Show flights from Boston to Atlanta", 
the principal object would be "flights". The third 
section lists some 15 constraints on the object ex- 
plicitly mentioned in the enquiry, like ". . . one-way 
from New York to Boston on Sunday" . Initial test- 
ing proved that these three sections covered the form 
and content of most enquiries within the domain, 
but to account for unforeseen material the judge is 
also presented with a "miscellaneous" category. De- 
pending on the character of the options, form en- 
tries are either multiple-choice or free-text. All form 
entries may be negated ("No stopovers") and dis- 
junctive enquiries are indicated by dint of index- 
ing ("Delta on Thursday or American on Friday"). 
When the page is exited, the contents of the com- 
pleted form are stored for further use. 

Each translated utterance is judged in three ver- 
sions, by different judges. The flrst two versions are 
the source and target speech files; the third time, the 
form is filled in from the text version of the source 
utterance. (The judging tool allows a mode in which 
the text version is displayed instead of an audio file 
being played.) The intention is that the source text 
version of the utterance should act as a baseline with 
which the source and target speech versions can re- 
spectively be compared. Comparison is carried out 
by a fourth judge. Here, the contents of the form en- 
tries for two versions of the utterance are compared. 
The judge has to decide whether the contents of each 
field in the form are compatible between the two ver- 
sions. 

When the forms for two versions of an utterance 
have been filled in and compared, the results can 
be examined for comprehensibility in terms of the 
standard notions of precision and recall. We say 
that the recall of version 2 of the utterance with 
respect to version 1 is the proportion of the fields 
filled in version 1 that are filled in compatibly in 
version 2. Conversely, the precision is the proportion 
of the fields filled in in version 2 that are filled in 
compatibly in version 1. 

The recall and precision scores together dc^fine a 
two-element vector which we will call the compre- 
hensibility of version 2 with respect to version 1. 
We can now define Csource to be the comprehensi- 
bility of the source speech with respect to the source 



text, and C tar get to be the comprehensibility of the 
target speech with respect to the source text. Fi- 
naUy, we define the quahty of the translation to be 

1 source ^target) ^ whcrC C source ^target ill 

a natural way can be interpreted as the extent to 
which comprehensibility has degraded as a result of 
the translation process. At the end of the following 
section, we describe an experiment in which we use 
this measure to evaluate the quality of translation 
in the English French version of SLT. 

6 An evaluation of the Spoken 
Language Translator 

We begin by presenting the results of tests run in 
speech-to-text mode on versions of the SLT system 
developed for six different language-pairs: English 
— > Swedish, English — > French, Swedish English, 
Swedish French, Swedish — > Danish, and English 
Danish. Before going any further, it must be 
stressed that the various versions of the system differ 
in important ways; some language-pairs are intrinsi- 
cally much easier than others, and some versions of 
the system have received far more effort than others. 

In terms of difficulty, Swedish Danish is clearly 
the easiest language-pair, and Swedish French is 
clearly the hardest. English French is easier than 
Swedish French, but substantially more difficult 
than any of the others. English Swedish, Swedish 
English and English Danish are all of compa- 
rable difficulty. We present approximate figures for 
the amounts of effort devoted to each language pair 
in conjunction with the other results. 

We evaluated performance on each language-pair 
in the manner described in Section 5.1 above, tak- 



ing as input two sets of 200 recorded speech utter- 
ances each (one for English and one for Swedish) 
which had not previously been used for system de- 
velopment. Judging was done by subjects who had 
not participated in system development, were native 
speakers of the target language, and were fluent in 
the source language. Results are presented both for 
a fully automatic version of the system (Table ||), 
and for a version with a simulated 'abort' button 
(Table |). 

Finally, we turn to a preliminary experiment 
which used the speec h-to -speech evaluation method- 
ology from Section 5.2 above. A set of 200 pre- 
viously unseen English utterances were translated 
by the system into French speech, using the same 
kind of subjects as in the previous experiments. 
Source-language and target-language speech was 
synthesized using commercially available, state-of- 
the-art synthesizers (TrueTalk from Entropies and 



CNETVOX from ELAN Informatique, respectively). 
The subjects were only allowed to hear each utter- 
ance once. The results were evaluated in the manner 
described, to produce figures for comprehensibility 
of source and target speech respectively. The figures 
are presented in Table ||; we expect to be able to 
present a more detailed discussion of their signifi- 
cance by the time of the workshop. 

In summary, we have improved the standard eval- 
uation method for speech translation by developing 
a feasible alternative with a more fine-grained tax- 
onomy of acceptability. In order to make the task 
of evaluation more realistic, we have also created a 
method in which instead of textual translations it is 
the spoken form that is judged. This method is cur- 
rently in embryonic form, but the pilot experiment 
described here leads us to think that the method 
shows promise for further development. 

An interesting future task would be to investigate 
the significance of various kinds of written-language 
translation errors in terms of reducing comprehen- 
sibility of the spoken output. This would amount 
to systematically comparing Ctarget with results ob- 
tained in speech-to-text evaluations, divided up ac- 
cording to error categories such as those in our tax- 
onomy. 
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Table 1: Translation results for six language pairs on 200 unseen utterances, all utterances in test set 
counted. Note that in both tables on this page, the "effort" figures refer specifically to translation work for 
the language pair in question, and exclude work on grammar and lexicon development for the individual 
languages. 



Source language 


English 


English 


Swedish 


Swedish 


Swedish 


English 


Target language 


Swedish 


French 


English 


French 


Danish 


Danish 


T^iTrnvt 1 riPrcjon-mnntVici 1 

A—JLLkJL \j 1 LVOJ. DV.'ll lli.Wlllili.O / 


8-10 


3-5 


3-5 


1-2 


0.5 


<0.5 




4:U.U /o 


oZi.yj /o 


4nJ . U /O 


X t7 . u /o 


0\J.O /o 




Unnatural style 


14.0% 


10.5% 


4.5% 


15.0% 


0.0% 


0.0% 


Minor syntactic errors 


12.0% 


3.5% 


12.0% 


13.0% 


37.5% 


28.0% 


Clearly useful 


72.0% 


66.0% 


61.5% 


47.0% 


74.0% 


55.0% 


Major syntactic errors 


7.0% 


2.5% 


7.5% 


13.0% 


0.0% 


0.0% 


Partial translation 


6.5% 


11.5% 


14.5% 


17.5% 


1.5% 


1.5% 


Borderline 


13.5% 


14.0% 


22.0% 


30.5% 


1.5% 


1.5% 


Nonsense 


7.5% 


13.0% 


10.5% 


18.0% 


13.0% 


30.5% 


Bad translation 


5.0% 


5.5% 


4.5% 


3.5% 


9.0% 


10.5% 


No translation 


2.0% 


1.5% 


1.5% 


1.0% 


2.5% 


2.5% 


Clearly useless 


14.5% 


20.0% 


16.5% 


22.5% 


24.5% 


43.5% 



Table 2: Translation results for six language pairs on 200 unseen utterances, ignoring utterances judged as 
recognition failures. 



Source language 


English 


English 


Swedish 


Swedish 


Swedish 


English 


Target language 


Swedish 


French 


English 


French 


Danish 


Danish 


Effort (person-months) 


8-10 


3-5 


3-5 


1-2 


0.5 


<0.5 


Fully acceptable 


55.8% 


65.8% 


60.7% 


23.1% 


49.0% 


35.9% 


Unnatural style 


15.8% 


12.9% 


6.4% 


19.2% 


0.0% 


0.0% 


Minor syntactic errors 


12.1% 


3.2% 


11.4% 


15.4% 


38.1% 


35.9% 


Clearly useful 


83.7% 


81.9% 


78.5% 


57.7% 


87.1% 


71.8% 


Major syntactic errors 


7.9% 


2.6% 


10.0% 


12.8% 


0.0% 


0.0% 


Partial translation 


2.4% 


5.8% 


5.0% 


14.1% 


0.7% 


2.1% 


Borderline 


10.3% 


8.4% 


15.0% 


26.9% 


0.7% 


2.1% 


Nonsense 


3.0% 


4.5% 


2.9% 


11.5% 


4.8% 


12.4% 


Bad translation 


1.2% 


3.2% 


2.1% 


2.6% 


5.4% 


11.7% 


No translation 


1.8% 


1.9% 


1.4% 


1.3% 


2.0% 


2.1% 


Clearly useless 


6.0% 


9.6% 


6.4% 


15.4% 


12.2% 


26.2% 


(Utterances ignored) 


35 


45 


60 


44 


53 


55 



Table 3: Relative comprehensibility of source and target speech for English — > French test on 200 unseen 
utterances. 





Source 


Target 


Difference 


Quality 


Precision 


97.6% 


86.0% 


11.6% 


88.4% 


Recall 


97.5% 


84.0% 


13.5% 


86.5% 



